Abstract
This is a brief presentation of the methodology used for the CODA19 phenotyper project and some of the preliminary results. N.B. Those results do not include the imaging portion of the project and only includes the data from the CHUM.
The objective of this project was to identify clinical phenotypes in patients with COVID-19, assess their interaction with the response to specific supportive management strategies, and evaluate their association with outcomes.
In this document, we will successively describe the different steps that were undertook :
The CODA19 database was used for the project. Data for the following variables were available for analysis :
# This chunk of code queries the CODA19 database and runs SQL scripts extracting the aforementionned data
coda19 <- DBI::dbConnect(RSQLite::SQLite(), "../py_eyamga/covidb_version-1.0.0.db")
files_path <- list.files(path = "./sql", full.names=TRUE)
files_names <- as_tibble(str_split_fixed(list.files(path = "./sql"), pattern =".sql", n=2))[[1]]
dflist = list()
for (i in seq_along(files_path)){
#dbGetQuery(coda19, statement = read_file(files_path[i]))
tmp <- dbGetQuery(coda19, statement = read_file(files_path[i]))
assign(files_names[i], tmp)
dflist[[i]] = tmp
write.csv(x = dflist[i], file = paste0(files_names[i], ".csv"))
}
The output of this step was all the aforementioned variables for all COVID 19 hospitalizations. A COVID 19 hospitalization was defined as any hospitalization episode of patient with a positive COVID-19 PCR test within 73 days of the admission. To reduce the number of features, drugs were mapped to MESH classes using the RxClass API and ICD10 codes were mapped to Charlson’s comorbidities.
3 different datasets were obtained at different timestamp from the admission : 24h, 48h, and 72h.
Here is a preview of the raw dataset at 24 hours.
For more details, take a look a the complete descriptive statistics in the following pdf files.
We removed all variables for which more than 35% of the data was missing, and all observations for which more than 35% of the variables were missing. We imputed the remaining data using simple CART imputation a methodology that is robust against outliers, multicollinearity and skewed distributions. Imputation was conducted using all variables except the death and the patient identifiers.
# Getting predictorMatrix and modifying it to exclude 2 variables from the imputation set
matrix_72h <- mice::mice(covid_72h2, method = "cart", m=1, maxit = 0)
pred_matrix72h <- matrix_72h$predictorMatrix
pred_matrix72h[,'patient_site_uid'] <- 0
pred_matrix72h[,'death'] <- 0
# Imputing
covidimputer <- mice::mice(covid_72h2, pred = pred_matrix72h, method = "cart", m=1)
covid72h_imputed <- complete(covidimputer, 1)
The description of the imputed datasets was also provided as pdf files. Briefly :
This will be done using the 72h dataset.
To select the most important variables we conducted PCA on the continuous variables and MCA on the categorical variables.
Because the dataset contained continuous variables in different format (min, max, mean), we used PCA to choose the subtype that mostly explained the variance in the data.
pca.coda19 <- PCA(coda19_continuous,
quali.sup = c(54, 55),
graph = FALSE)
fviz_pca_var(pca.coda19,
col.var = "contrib", # Color by contributions to the PC alternatives, color by cos2
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Warning: ggrepel: 41 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The PCA variable plot is messy but still shows use the most important continuous variables in our dataset.
To have a better appreciation of the results, below are the variables with most important cos2 on PC1 and PC2.
fviz_cos2(pca.coda19, choice = "var", axes = 1:2, top=20) # Seeing mots important cos2 variables on PC1 and PC2
However, when looking at the overall capacity of our principal components to explain the variation of our data, the 5 PCs only explain 44% of the variance.
head(pca.coda19$eig)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 7.512449 14.174432 14.17443
## comp 2 4.643389 8.761111 22.93554
## comp 3 4.076958 7.692373 30.62792
## comp 4 3.950792 7.454324 38.08224
## comp 5 3.290256 6.208029 44.29027
## comp 6 3.112389 5.872432 50.16270
The similar exercise was conducted using MCA on the categorical variables
mca.coda19 <- MCA(coda19_categorical,
quali.sup = c(90), # Removing death from observation but MV kept
graph = FALSE)
fviz_mca_var(mca.coda19,
col.var = "contrib", # Color by contributions to the PC alternatives, color by cos2
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Warning: ggrepel: 187 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The MCA variable plot seems to show a pattern for ICU admitted patients as sedation, muscular blocking agents and mechanical ventilation are all in the same area of the plot. The other medications do not seem to form a particular group.
Looking at the percentage of variance, the results are not particularly powerful because of the curse of dimensionality : in other words, over 90 medications group were included in this portion of the analysis but some were only rarely found in the dataset.
print(head(mca.coda19$eig))
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.05085437 4.269848 4.269848
## dim 2 0.04052505 3.402575 7.672423
## dim 3 0.03257750 2.735281 10.407703
## dim 4 0.02909950 2.443260 12.850964
## dim 5 0.02667261 2.239493 15.090456
## dim 6 0.02565848 2.154344 17.244801
Once again, we identified the most contributive variables in order to determine which one would be kept for the second portion of the analysis.
N.B. Comorbidities ended not being candidate variables.
fviz_contrib(mca.coda19, choice = "var", axes = 1, top = 10)
As mentioned earlier, we selected the most important variables as it relates to the initial PCA and MCA analysis to respectively select the candidate continuous and categorical variables for our clustering effort.
We reconducted PCA and MCA with those selected variables only and showed an increase in the variance explained by the principal components validating our selection.
coda19_continuous_labs <- coda19%>%select(c("patient_age", "hemoglobin_min", "plt_max", "wbc_max", "neutrophil_max",
"lymph_min", "mono_max", "eos_min", "sodium_min",
"creatinine_max"))
coda19_continuous_vitals <- coda19%>%select(c("sbp_max", "sbp_min", "dbp_mean", "temp_max", "temp_min", "so2_min", "rr_mean"))
coda19_continuous2 <- coda19_continuous_labs%>%cbind(coda19_continuous_vitals)
coda19_continuous2$death <- coda19$death
coda19_continuous2$mv <- coda19$mv
coda19_continuous2$icu <- coda19$icu
# Doing PCA
pca.coda19.continuous <- PCA(coda19_continuous2,
quali.sup = c(18,19,20), #Removing ICU admission and Death from the PCA analysis
graph = FALSE)
fviz_pca_var(pca.coda19.continuous,
col.var = "contrib", # Color by contributions to the PC alternatives, color by cos2
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
kable(head(pca.coda19.continuous$eig))
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 2.099611 | 12.350656 | 12.35066 |
| comp 2 | 2.009039 | 11.817877 | 24.16853 |
| comp 3 | 1.649945 | 9.705557 | 33.87409 |
| comp 4 | 1.586820 | 9.334238 | 43.20833 |
| comp 5 | 1.231291 | 7.242885 | 50.45121 |
| comp 6 | 1.084188 | 6.377577 | 56.82879 |
# Selecting categorical data
coda19_categorical2 <- coda19_categorical%>%select("female", "male", "vasopressors", "glucocorticoids", "anti_bacterial_agents", "antifungal_agents",
"anti_ulcer_agents","immunosuppressive_agents", "diuretics", "platelet_aggregation_inhibitors",
"sedation","neuromuscular_blocking_agents", "bronchodilator_agents", "hiv_medication", "mv", "icu",
"death")
mca.coda19.categorical <- MCA(coda19_categorical2,
quali.sup = c(17), # Removing death from observation but MV kept
graph = FALSE)
As we can see below, the selected categorical variables are better at describing the variance of the data with cumulative percentage of 62% with 6 PCs.
head(mca.coda19.categorical$eig)
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.19791808 19.791808 19.79181
## dim 2 0.12379136 12.379136 32.17094
## dim 3 0.10032980 10.032980 42.20392
## dim 4 0.07400790 7.400790 49.60471
## dim 5 0.06788275 6.788275 56.39299
## dim 6 0.06330158 6.330158 62.72315
The MCA does distinguish some patterns. The biplot effectively shows that the selected categorical variables are able to distinguish 2 groups of patients : ward vs icu patients mostly based on the medication.
fviz_mca_biplot(mca.coda19.categorical,
geom.ind = "point", # show points only (nbut not "text")
col.ind = coda19_categorical2$icu, # color by death
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "ICU",
repel = TRUE
)
## Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
### FAMD analysis including both continuous and categorical variables
The ultimate goal was to use all of the variables available continuous and categorical, in order to mine some underlying clinical patterns. To do so, we conducted a FAMD analysis, a component analysis that uses Gower distance to measure the dissimilarity between observations.
# Preparing the dataset to retain only the selected variables
coda19_complete <- coda19_continuous2%>%select(-c("death","icu","mv"))%>%cbind(coda19_categorical2)
The results are presented below :
famd.coda19 <- FAMD(coda19_complete,
sup.var = c(33, 34), # Removing icu and death from observation but MV kept
graph = FALSE)
print(famd.coda19$eig)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.251239 10.160121 10.16012
## comp 2 2.412674 7.539605 17.69973
## comp 3 2.227780 6.961813 24.66154
## comp 4 1.894348 5.919836 30.58138
## comp 5 1.650218 5.156930 35.73831
fviz_famd_ind(famd.coda19,
axes = c(1,2),
geom.ind = "point",
repel = TRUE,
label = "none",
col.ind = as.factor(coda19_complete$icu),
palette = c("#0073C2FF", "#EFC000FF", "#868686FF", "#4A6990FF"),
addEllipses = TRUE,
legend.title = "ICU")
## Warning in if (col.ind %in% c("cos2", "contrib", "coord")) partial = NULL: the
## condition has length > 1 and only the first element will be used
fviz_famd_ind(famd.coda19,
axes = c(1,2),
geom.ind = "point",
repel = TRUE,
label = "none",
col.ind = as.factor(coda19_complete$death),
palette = c("#0073C2FF", "#EFC000FF", "#868686FF", "#4A6990FF"),
addEllipses = TRUE,
legend.title = "DEATH")
## Warning in if (col.ind %in% c("cos2", "contrib", "coord")) partial = NULL: the
## condition has length > 1 and only the first element will be used
Visually, the FAMD is not linearly able to separate the data into clusters of clinical outcomes : icu admission or death. Bu several aspects mut consider here : 1) the data is not linearly separable, therefore FAMD might simply not be adequate to conduct the analysis in the first place 2) There are other clusters not related to clinical outcomes that can be identified.
The next step of the project was to conduct cluster analysis and identifying clinical phenotypes based on the selected variables conducting PCA and CMA.
The objective of clustering is to identify patterns and clusters within the data. It is a well documented unsupervised machine learning method used in multiple fields of molecular biology.
In our case, we leveraged this technique to identify COVID 19 phenotype using the 31 selected variables. More precisely, we used PAM clustering, a method that analogous to K-Means clustering but that uses medoids instead of centroids and a method that is compatible with mixed data types as in our case.
# Calculating Gower Distance
coda19_gower <- cluster::daisy(coda19_complete%>%select(-c("death", "icu")), metric = "gower") #Removing death and icu admission from the clustering
# Silhouette analysis for values of K between 2 and 10
set.seed(123)
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(coda19_gower,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width)
We chose k=3 as the appropriate numbers of clusters for our dataset as it yielded the highest silhouette width.
pam_fit <- pam(coda19_gower, diss = TRUE, 3)
pam_results <- coda19_complete %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
clusters <- factor(pam_fit$clustering)
Once the PAM clustering was conducted, we assessed its validity through :
set.seed(10)
tsne_obj <- Rtsne(coda19_gower, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = as.factor(pam_fit$clustering))
p <- ggplot(aes(x = X, y = Y,), data = tsne_data) +
geom_point(aes(color = cluster))+
ggtitle("3 K PAM clustering")
show(p)
results <- coda19_complete%>%mutate(cluster = pam_fit$clustering)
The following patients represent the medoids :
coda19_complete[pam_fit$medoids, ]
## patient_age hemoglobin_min plt_max wbc_max neutrophil_max lymph_min
## 31 63 128 228 7.5 6.13 0.85
## 295 64 94 196 3.0 1.86 0.89
## 179 71 123 310 5.6 4.87 0.34
## mono_max eos_min sodium_min creatinine_max sbp_max sbp_min dbp_mean
## 31 0.60 0.00 138 49 128 108 67.75
## 295 0.18 0.03 139 54 147 101 72.13
## 179 0.57 0.09 139 55 152 103 71.29
## temp_max temp_min so2_min rr_mean female male vasopressors glucocorticoids
## 31 38.1 36.4 88 20.00 0 1 0 0
## 295 37.8 36.3 86 19.69 0 1 0 0
## 179 37.0 36.1 91 19.60 1 0 0 0
## anti_bacterial_agents antifungal_agents anti_ulcer_agents
## 31 1 0 0
## 295 1 0 1
## 179 1 0 0
## immunosuppressive_agents diuretics platelet_aggregation_inhibitors sedation
## 31 0 0 0 0
## 295 0 0 0 0
## 179 0 0 0 0
## neuromuscular_blocking_agents bronchodilator_agents hiv_medication mv icu
## 31 0 0 0 0 0
## 295 0 0 0 0 0
## 179 0 0 0 0 0
## death
## 31 0
## 295 0
## 179 0
A pdf file with the distribution of variables accross clusters has been provided.
# This chunk codes exports descriptive statistics according to clusters
autoEDA_results_k <- autoEDA(results,
y = "cluster", returnPlotList = TRUE,
outcomeType = "automatic", removeConstant = TRUE,
removeZeroSpread = TRUE, removeMajorityMissing = TRUE,
imputeMissing = TRUE, clipOutliers = FALSE,
minLevelPercentage = 0.025, predictivePower = TRUE,
outlierMethod = "tukey", lowPercentile = 0.01,
upPercentile = 0.99, plotCategorical = "stackedBar",
plotContinuous = "boxplot", bins = 30,
transparency = 0.7,
rotateLabels = TRUE, color = "#26A69A",
verbose = FALSE)
p <- plot_grid(plotlist = autoEDA_results_k$plots, ncol = 3)
ggsave("cluster_analysis_3k_v2.svg", width=20,height=84, limitsize = FALSE)
Below a table one of the different variables across the clusters.
tableone::kableone(tableOne_compared)%>%
kable_classic(full_width = F, html_font = "Computer Modern")
| 1 | 2 | 3 | p | test | |
|---|---|---|---|---|---|
| n | 103 | 106 | 183 | ||
| patient_age (mean (SD)) | 67.02 (18.09) | 70.08 (15.65) | 69.95 (20.69) | 0.385 | |
| hemoglobin_min (mean (SD)) | 120.40 (17.93) | 112.25 (21.17) | 109.90 (19.54) | <0.001 | |
| plt_max (mean (SD)) | 237.87 (108.41) | 218.57 (101.98) | 267.91 (124.63) | 0.001 | |
| wbc_max (mean (SD)) | 7.90 (6.05) | 10.97 (23.93) | 8.93 (4.91) | 0.231 | |
| neutrophil_max (mean (SD)) | 6.02 (4.79) | 6.45 (3.98) | 6.91 (4.45) | 0.257 | |
| lymph_min (mean (SD)) | 1.04 (1.66) | 0.83 (0.57) | 1.05 (0.70) | 0.176 | |
| mono_max (mean (SD)) | 0.66 (0.35) | 0.74 (0.96) | 0.70 (0.44) | 0.635 | |
| eos_min (mean (SD)) | 0.03 (0.06) | 0.05 (0.10) | 0.06 (0.17) | 0.251 | |
| sodium_min (mean (SD)) | 128.46 (29.81) | 130.53 (25.92) | 132.54 (24.95) | 0.452 | |
| creatinine_max (mean (SD)) | 133.83 (153.66) | 149.09 (174.37) | 118.26 (152.60) | 0.277 | |
| sbp_max (mean (SD)) | 144.94 (19.76) | 151.37 (19.63) | 148.93 (23.33) | 0.092 | |
| sbp_min (mean (SD)) | 110.94 (14.59) | 112.31 (16.88) | 110.80 (17.45) | 0.740 | |
| dbp_mean (mean (SD)) | 73.73 (11.32) | 71.92 (7.74) | 70.79 (8.57) | 0.035 | |
| temp_max (mean (SD)) | 37.72 (1.01) | 37.83 (0.91) | 37.64 (1.01) | 0.320 | |
| temp_min (mean (SD)) | 36.18 (0.64) | 36.07 (0.62) | 36.15 (0.61) | 0.369 | |
| so2_min (mean (SD)) | 87.42 (15.87) | 88.42 (10.52) | 90.37 (7.60) | 0.076 | |
| rr_mean (mean (SD)) | 20.73 (2.58) | 21.05 (3.57) | 20.40 (2.53) | 0.174 | |
| female = 1 (%) | 0 ( 0.0) | 0 ( 0.0) | 183 (100.0) | <0.001 | |
| male = 1 (%) | 103 (100.0) | 106 (100.0) | 0 ( 0.0) | <0.001 | |
| vasopressors = 1 (%) | 13 ( 12.6) | 30 ( 28.3) | 26 ( 14.2) | 0.003 | |
| glucocorticoids = 1 (%) | 7 ( 6.8) | 20 ( 18.9) | 17 ( 9.3) | 0.012 | |
| anti_bacterial_agents = 1 (%) | 65 ( 63.1) | 80 ( 75.5) | 103 ( 56.3) | 0.005 | |
| antifungal_agents = 1 (%) | 0 ( 0.0) | 1 ( 0.9) | 1 ( 0.5) | 0.630 | |
| anti_ulcer_agents = 1 (%) | 0 ( 0.0) | 106 (100.0) | 85 ( 46.4) | <0.001 | |
| immunosuppressive_agents = 1 (%) | 0 ( 0.0) | 5 ( 4.7) | 4 ( 2.2) | 0.074 | |
| diuretics = 1 (%) | 13 ( 12.6) | 28 ( 26.4) | 44 ( 24.0) | 0.031 | |
| platelet_aggregation_inhibitors = 1 (%) | 15 ( 14.6) | 42 ( 39.6) | 37 ( 20.2) | <0.001 | |
| sedation = 1 (%) | 1 ( 1.0) | 23 ( 21.7) | 13 ( 7.1) | <0.001 | |
| neuromuscular_blocking_agents = 1 (%) | 0 ( 0.0) | 1 ( 0.9) | 0 ( 0.0) | 0.259 | |
| bronchodilator_agents = 1 (%) | 25 ( 24.3) | 27 ( 25.5) | 42 ( 23.0) | 0.887 | |
| hiv_medication = 1 (%) | 2 ( 1.9) | 1 ( 0.9) | 2 ( 1.1) | 0.777 | |
| mv = 1 (%) | 0 ( 0.0) | 8 ( 7.5) | 4 ( 2.2) | 0.004 | |
| icu = 1 (%) | 14 ( 13.6) | 40 ( 37.7) | 24 ( 13.1) | <0.001 | |
| death = 1 (%) | 19 ( 18.4) | 34 ( 32.1) | 41 ( 22.4) | 0.055 | |
| cluster (mean (SD)) | 1.00 (0.00) | 2.00 (0.00) | 3.00 (0.00) | <0.001 |
In the following section, we experimented with different iteration of the dataset to observe its impact on clustering.
# Calculating Gower Distance
coda19_gower_nogender <- cluster::daisy(coda19_complete%>%select(-c('male','female', 'icu', 'death')) , metric = "gower")
# Removing gender and outcome variables from the clustering
# Clustering PAM
pam_fit_nogender <- pam(coda19_gower_nogender, diss = TRUE, 3)
# Plotting on TSNE
tsne_obj <- Rtsne(coda19_gower_nogender, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = as.factor(pam_fit_nogender$clustering))
p <- ggplot(aes(x = X, y = Y,), data = tsne_data) +
geom_point(aes(color = cluster))+
ggtitle("3 K PAM clustering")
show(p)
results_nogender <- coda19_complete%>%mutate(cluster = pam_fit_nogender$clustering)
alllist <- names(results_nogender)
catlist <- names(results_nogender[sapply(results_nogender, is.factor)])
tableOne_compared <- tableone::CreateTableOne(vars = alllist, data = results_nogender, strata = "cluster", factorVars = catlist, test=TRUE)
tableone::kableone(tableOne_compared)%>%
kable_classic(full_width = F, html_font = "Computer Modern")
| 1 | 2 | 3 | p | test | |
|---|---|---|---|---|---|
| n | 189 | 143 | 60 | ||
| patient_age (mean (SD)) | 65.65 (18.18) | 73.29 (20.27) | 70.72 (14.39) | 0.001 | |
| hemoglobin_min (mean (SD)) | 115.26 (19.47) | 115.24 (19.17) | 102.45 (20.56) | <0.001 | |
| plt_max (mean (SD)) | 231.26 (102.94) | 250.61 (108.09) | 285.83 (159.52) | 0.006 | |
| wbc_max (mean (SD)) | 9.75 (18.39) | 7.70 (4.54) | 11.10 (5.37) | 0.184 | |
| neutrophil_max (mean (SD)) | 6.51 (4.43) | 5.68 (4.15) | 8.75 (4.35) | <0.001 | |
| lymph_min (mean (SD)) | 0.95 (1.26) | 1.19 (0.74) | 0.64 (0.55) | 0.001 | |
| mono_max (mean (SD)) | 0.70 (0.74) | 0.66 (0.32) | 0.84 (0.64) | 0.137 | |
| eos_min (mean (SD)) | 0.03 (0.07) | 0.07 (0.14) | 0.07 (0.22) | 0.022 | |
| sodium_min (mean (SD)) | 129.98 (27.01) | 132.62 (25.64) | 129.87 (27.46) | 0.634 | |
| creatinine_max (mean (SD)) | 121.49 (140.16) | 106.24 (121.80) | 217.92 (245.17) | <0.001 | |
| sbp_max (mean (SD)) | 145.33 (20.59) | 148.20 (21.22) | 159.47 (22.09) | <0.001 | |
| sbp_min (mean (SD)) | 108.33 (13.93) | 114.31 (16.66) | 113.13 (21.85) | 0.003 | |
| dbp_mean (mean (SD)) | 70.87 (8.80) | 73.05 (9.96) | 72.20 (8.48) | 0.098 | |
| temp_max (mean (SD)) | 37.99 (0.96) | 37.22 (0.87) | 38.02 (0.89) | <0.001 | |
| temp_min (mean (SD)) | 36.27 (0.59) | 36.07 (0.64) | 35.89 (0.59) | <0.001 | |
| so2_min (mean (SD)) | 88.03 (12.46) | 92.34 (4.65) | 84.53 (15.00) | <0.001 | |
| rr_mean (mean (SD)) | 21.06 (3.06) | 19.91 (1.93) | 21.23 (3.65) | <0.001 | |
| female = 1 (%) | 75 (39.7) | 80 (55.9) | 28 ( 46.7) | 0.013 | |
| male = 1 (%) | 114 (60.3) | 63 (44.1) | 32 ( 53.3) | 0.013 | |
| vasopressors = 1 (%) | 14 ( 7.4) | 9 ( 6.3) | 46 ( 76.7) | <0.001 | |
| glucocorticoids = 1 (%) | 24 (12.7) | 9 ( 6.3) | 11 ( 18.3) | 0.031 | |
| anti_bacterial_agents = 1 (%) | 188 (99.5) | 0 ( 0.0) | 60 (100.0) | <0.001 | |
| antifungal_agents = 1 (%) | 1 ( 0.5) | 0 ( 0.0) | 1 ( 1.7) | 0.314 | |
| anti_ulcer_agents = 1 (%) | 78 (41.3) | 56 (39.2) | 57 ( 95.0) | <0.001 | |
| immunosuppressive_agents = 1 (%) | 6 ( 3.2) | 0 ( 0.0) | 3 ( 5.0) | 0.051 | |
| diuretics = 1 (%) | 21 (11.1) | 20 (14.0) | 44 ( 73.3) | <0.001 | |
| platelet_aggregation_inhibitors = 1 (%) | 40 (21.2) | 32 (22.4) | 22 ( 36.7) | 0.042 | |
| sedation = 1 (%) | 0 ( 0.0) | 3 ( 2.1) | 34 ( 56.7) | <0.001 | |
| neuromuscular_blocking_agents = 1 (%) | 0 ( 0.0) | 0 ( 0.0) | 1 ( 1.7) | 0.062 | |
| bronchodilator_agents = 1 (%) | 49 (25.9) | 27 (18.9) | 18 ( 30.0) | 0.163 | |
| hiv_medication = 1 (%) | 2 ( 1.1) | 1 ( 0.7) | 2 ( 3.3) | 0.291 | |
| mv = 1 (%) | 0 ( 0.0) | 0 ( 0.0) | 12 ( 20.0) | <0.001 | |
| icu = 1 (%) | 26 (13.8) | 10 ( 7.0) | 42 ( 70.0) | <0.001 | |
| death = 1 (%) | 46 (24.3) | 24 (16.8) | 24 ( 40.0) | 0.002 | |
| cluster (mean (SD)) | 1.00 (0.00) | 2.00 (0.00) | 3.00 (0.00) | <0.001 |
# Loading the ICU dataset
# covidicu72h_imputed <- read_csv (./"covidicu72h_imputed_filtered.csv)
# Converting appropriate columns to factor
covidicu72h_imputed <- covidicu72h_imputed%>%mutate_if(function(x){(length(unique(x)) <= 3)}, function(x){as.factor(x)})
# Calculating Gower Distance
coda19_gower_icu <- cluster::daisy(covidicu72h_imputed%>%select(-c('male','female','death')) , metric = "gower")
# Removing gender and outcome variables from the clustering
# Optimal K
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(coda19_gower_icu,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width)
# Clustering PAM
pam_fit_icu <- pam(coda19_gower_icu, diss = TRUE, 3)
set.seed(25)
# Plotting on TSNE
tsne_obj <- Rtsne(coda19_gower_icu, is_distance = TRUE, perplexity = 15)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = as.factor(pam_fit_icu$clustering))
p <- ggplot(aes(x = X, y = Y,), data = tsne_data) +
geom_point(aes(color = cluster))+
ggtitle("3 K PAM clustering")
show(p)
results_icu <- covidicu72h_imputed%>%mutate(cluster = pam_fit_icu$clustering)
alllist <- names(results_icu)
catlist <- names(results_icu[sapply(results_icu, is.factor)])
tableOne_compared <- tableone::CreateTableOne(vars = alllist, data = results_icu, strata = "cluster", factorVars = catlist, test=TRUE)
tableone::kableone(tableOne_compared)%>%
kable_classic(full_width = F, html_font = "Computer Modern")
| 1 | 2 | 3 | p | test | |
|---|---|---|---|---|---|
| n | 27 | 17 | 34 | ||
| patient_age (mean (SD)) | 63.22 (14.15) | 55.24 (15.81) | 66.41 (10.54) | 0.020 | |
| hemoglobin_min (mean (SD)) | 115.81 (15.80) | 99.71 (21.94) | 104.71 (19.78) | 0.016 | |
| plt_max (mean (SD)) | 265.74 (122.72) | 199.00 (96.95) | 304.18 (116.24) | 0.011 | |
| wbc_max (mean (SD)) | 9.03 (5.37) | 9.94 (5.14) | 11.64 (4.90) | 0.138 | |
| neutrophil_max (mean (SD)) | 7.24 (5.36) | 7.69 (4.11) | 9.14 (3.41) | 0.209 | |
| lymph_min (mean (SD)) | 0.75 (0.35) | 0.47 (0.21) | 0.60 (0.43) | 0.041 | |
| mono_max (mean (SD)) | 0.65 (0.41) | 0.71 (0.57) | 0.75 (0.52) | 0.769 | |
| eos_min (mean (SD)) | 0.02 (0.05) | 0.01 (0.01) | 0.06 (0.11) | 0.050 | |
| sodium_min (mean (SD)) | 123.37 (37.06) | 135.76 (2.68) | 125.44 (35.99) | 0.439 | |
| chloride_min (mean (SD)) | 94.59 (20.09) | 101.41 (3.20) | 92.06 (24.60) | 0.301 | |
| albumin_min (mean (SD)) | 30.22 (4.47) | 27.76 (5.25) | 26.47 (3.75) | 0.005 | |
| bicarbonate_min (mean (SD)) | 23.27 (2.93) | 21.04 (4.35) | 19.96 (3.67) | 0.003 | |
| bun_max (mean (SD)) | 8.48 (5.76) | 11.22 (9.00) | 14.59 (9.17) | 0.017 | |
| glucose_max (mean (SD)) | 8.83 (4.32) | 9.98 (3.56) | 13.60 (4.61) | <0.001 | |
| anion_gap_mean (mean (SD)) | 10.21 (2.39) | 8.98 (1.73) | 11.00 (2.78) | 0.025 | |
| ptt_max (mean (SD)) | 28.85 (7.50) | 42.35 (30.16) | 49.00 (30.00) | 0.009 | |
| alt_max (mean (SD)) | 62.93 (51.74) | 37.76 (25.11) | 45.09 (50.39) | 0.172 | |
| ast_max (mean (SD)) | 78.89 (65.84) | 56.88 (30.61) | 64.26 (50.61) | 0.366 | |
| bili_tot_max (mean (SD)) | 11.70 (7.62) | 11.47 (7.38) | 12.76 (7.55) | 0.796 | |
| tropot_max (mean (SD)) | 46.44 (75.46) | 91.41 (122.86) | 237.56 (742.08) | 0.303 | |
| lipase_max (mean (SD)) | 61.59 (59.71) | 131.35 (231.82) | 56.62 (42.26) | 0.080 | |
| ck_max (mean (SD)) | 808.74 (1444.08) | 535.65 (660.21) | 1064.09 (1555.87) | 0.423 | |
| weight_mean (mean (SD)) | 75.89 (14.90) | 80.48 (9.89) | 86.74 (18.58) | 0.032 | |
| fio2_max (mean (SD)) | 54.78 (29.31) | 68.18 (27.21) | 88.68 (17.72) | <0.001 | |
| lactate_max (mean (SD)) | 1.73 (1.41) | 1.63 (1.60) | 1.96 (1.00) | 0.647 | |
| svo2sat_min (mean (SD)) | 50.33 (22.84) | 62.12 (18.38) | 60.29 (21.02) | 0.113 | |
| temp_max (mean (SD)) | 38.41 (0.86) | 38.49 (0.88) | 37.86 (0.61) | 0.005 | |
| female = 1 (%) | 7 (25.9) | 4 (23.5) | 13 (38.2) | 0.448 | |
| male = 1 (%) | 20 (74.1) | 13 (76.5) | 21 (61.8) | 0.448 | |
| vasopressors = 1 (%) | 4 (14.8) | 14 (82.4) | 30 (88.2) | <0.001 | |
| glucocorticoids = 1 (%) | 2 ( 7.4) | 2 (11.8) | 8 (23.5) | 0.200 | |
| anti_bacterial_agents = 1 (%) | 21 (77.8) | 15 (88.2) | 32 (94.1) | 0.164 | |
| antifungal_agents = 1 (%) | 1 ( 3.7) | 0 ( 0.0) | 0 ( 0.0) | 0.384 | |
| immunosuppressive_agents = 1 (%) | 1 ( 3.7) | 0 ( 0.0) | 3 ( 8.8) | 0.370 | |
| diuretics = 1 (%) | 2 ( 7.4) | 1 ( 5.9) | 29 (85.3) | <0.001 | |
| platelet_aggregation_inhibitors = 1 (%) | 7 (25.9) | 4 (23.5) | 10 (29.4) | 0.896 | |
| sedation = 1 (%) | 0 ( 0.0) | 11 (64.7) | 26 (76.5) | <0.001 | |
| neuromuscular_blocking_agents = 1 (%) | 0 ( 0.0) | 0 ( 0.0) | 1 ( 2.9) | 0.519 | |
| bronchodilator_agents = 1 (%) | 5 (18.5) | 2 (11.8) | 6 (17.6) | 0.825 | |
| hiv_medication = 1 (%) | 1 ( 3.7) | 0 ( 0.0) | 1 ( 2.9) | 0.738 | |
| mv = 1 (%) | 0 ( 0.0) | 2 (11.8) | 10 (29.4) | 0.006 | |
| death = 1 (%) | 1 ( 3.7) | 6 (35.3) | 18 (52.9) | <0.001 | |
| cluster (mean (SD)) | 1.00 (0.00) | 2.00 (0.00) | 3.00 (0.00) | <0.001 |
covidicu72h_imputed_cont <- covidicu72h_imputed%>%select(c("patient_age", "hemoglobin_min", "plt_max", "wbc_max", "neutrophil_max",
"lymph_min", "mono_max", "eos_min", "sodium_min", "chloride_min", "albumin_min", "bicarbonate_min",
"bun_max", "glucose_max", "anion_gap_mean", "ptt_max", "alt_max", "ast_max", "bili_tot_max",
"tropot_max",
"lipase_max", "ck_max", "weight_mean", "fio2_max","lactate_max", "svo2sat_min", "temp_max"
,"female", "male", "mv", "death"))
pca.coda19 <- PCA(covidicu72h_imputed_cont,
quali.sup = c(28,29, 30,31), # cat variables from the dataset not considered : death, mv, female/male
graph = FALSE)
coda19.kmeans <- kmeans(covidicu72h_imputed_cont[, 1:28], centers = 3, nstart = 60)
coda19.obs.clusters <- as.factor(coda19.kmeans$cluster)
# Color variables by groups
fviz_pca_ind(pca.coda19,
geom.ind = "point",
col.ind = coda19.obs.clusters,
palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
addEllipses = FALSE,
legend.title = "Cluster")
results_icu <- covidicu72h_imputed_cont%>%mutate(cluster = coda19.obs.clusters)
alllist <- names(results_icu)
catlist <- names(results_icu[sapply(results_icu, is.factor)])
tableOne_compared <- tableone::CreateTableOne(vars = alllist, data = results_icu, strata = "cluster", factorVars = catlist, test=TRUE)
tableone::kableone(tableOne_compared)%>%
kable_classic(full_width = F, html_font = "Computer Modern")
| 1 | 2 | 3 | p | test | |
|---|---|---|---|---|---|
| n | 1 | 9 | 68 | ||
| patient_age (mean (SD)) | 65.00 (NA) | 67.33 (11.02) | 62.25 (13.99) | NA | |
| hemoglobin_min (mean (SD)) | 96.00 (NA) | 106.33 (21.81) | 107.78 (19.82) | NA | |
| plt_max (mean (SD)) | 354.00 (NA) | 271.44 (107.31) | 266.22 (122.99) | NA | |
| wbc_max (mean (SD)) | 12.40 (NA) | 14.67 (7.48) | 9.77 (4.62) | NA | |
| neutrophil_max (mean (SD)) | 11.59 (NA) | 11.50 (4.43) | 7.67 (4.18) | NA | |
| lymph_min (mean (SD)) | 0.00 (NA) | 0.44 (0.25) | 0.66 (0.38) | NA | |
| mono_max (mean (SD)) | 1.36 (NA) | 0.95 (0.61) | 0.66 (0.47) | NA | |
| eos_min (mean (SD)) | 0.23 (NA) | 0.05 (0.09) | 0.03 (0.08) | NA | |
| sodium_min (mean (SD)) | 141.00 (NA) | 124.56 (38.60) | 127.09 (31.89) | NA | |
| chloride_min (mean (SD)) | 103.00 (NA) | 89.78 (28.23) | 95.54 (19.28) | NA | |
| albumin_min (mean (SD)) | 22.00 (NA) | 25.78 (4.29) | 28.44 (4.57) | NA | |
| bicarbonate_min (mean (SD)) | 28.50 (NA) | 21.11 (3.69) | 21.27 (3.82) | NA | |
| bun_max (mean (SD)) | 10.60 (NA) | 15.64 (9.53) | 11.24 (8.30) | NA | |
| glucose_max (mean (SD)) | 7.40 (NA) | 10.88 (2.88) | 11.25 (5.01) | NA | |
| anion_gap_mean (mean (SD)) | 7.00 (NA) | 11.47 (2.84) | 10.18 (2.47) | NA | |
| ptt_max (mean (SD)) | 31.00 (NA) | 40.44 (14.45) | 40.74 (27.35) | NA | |
| alt_max (mean (SD)) | 67.00 (NA) | 75.67 (59.20) | 45.97 (45.06) | NA | |
| ast_max (mean (SD)) | 127.00 (NA) | 128.78 (63.03) | 58.76 (46.31) | NA | |
| bili_tot_max (mean (SD)) | 17.00 (NA) | 18.44 (12.90) | 11.21 (6.13) | NA | |
| tropot_max (mean (SD)) | 4340.00 (NA) | 216.22 (287.34) | 67.63 (82.23) | NA | |
| lipase_max (mean (SD)) | 84.00 (NA) | 101.78 (59.18) | 70.90 (124.96) | NA | |
| ck_max (mean (SD)) | 325.00 (NA) | 4340.67 (1194.88) | 407.79 (380.58) | NA | |
| weight_mean (mean (SD)) | 62.00 (NA) | 83.17 (19.47) | 81.71 (15.97) | NA | |
| fio2_max (mean (SD)) | 100.00 (NA) | 86.78 (27.99) | 70.18 (28.13) | NA | |
| lactate_max (mean (SD)) | 1.60 (NA) | 2.27 (1.38) | 1.75 (1.28) | NA | |
| svo2sat_min (mean (SD)) | 88.00 (NA) | 57.33 (24.30) | 56.78 (21.12) | NA | |
| temp_max (mean (SD)) | 38.60 (NA) | 38.24 (0.49) | 38.17 (0.85) | NA | |
| female = 1 (%) | 0 ( 0.0) | 2 ( 22.2) | 22 ( 32.4) | 0.659 | |
| male = 1 (%) | 1 (100.0) | 7 ( 77.8) | 46 ( 67.6) | 0.659 | |
| mv = 1 (%) | 1 (100.0) | 5 ( 55.6) | 6 ( 8.8) | <0.001 | |
| death = 1 (%) | 0 ( 0.0) | 3 ( 33.3) | 22 ( 32.4) | 0.786 | |
| cluster (%) | <0.001 | ||||
| 1 | 1 (100.0) | 0 ( 0.0) | 0 ( 0.0) | ||
| 2 | 0 ( 0.0) | 9 (100.0) | 0 ( 0.0) | ||
| 3 | 0 ( 0.0) | 0 ( 0.0) | 68 (100.0) |